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0-  ABSTRACT  (Conllmi#  on  rovorco  sl<f«  If  n«co»««ry  end  Identify  by  block  numbtf) 

Workload  models  are  extremely  important  for  computer  performance  evaluation. 

The  problem  of  feature  reduction  for  the  purpose  of  the  formulation  of  workload 
models  has  received  widespread  attention.  This  paper  briefly  reviews  existing 
schemes  for  feature  selection  and  reduction,  and  proposes  a  feature  reduction 
algorithm  based  on  a  linear  decision-tree  classifier.  An  example  is  presented 
to  illustrate  the  use  and  validity  of  this  algorithm. 
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Introduction 


It  is  a  well-established  fact  that  the  workload  character¬ 
ization  is  a  very  important  step  in  the  performance  evaluation 
of  a  computer  system  [1,4].  The  workload  of  a  computer  system 
can  be  roughly  defined  as  the  set  of  all  inputs  (programs, 
data,  commands)  the  system  receives  from  its  environment  [4], 

In  forming  the  workload  model,  many  measurements  can  be  made 
on  this  set  of  inputs  through  accounting  logs,  software  or 
hardware  monitors.  Among  the  most  frequently  used  variables, 
for  example,  are  CPU  time,  core  requirements,  number  of 
disk/drum  requests,  the  length  of  I/O  requests,  the  number 
of  tape  I/Os,  etc. 

Quite  often,  the  data  collected  is  too  voluminous  to  be 
used  directly.  A  procedure  is  required  to  extract  or  select 
a  subset  of  the  data  that  can  still  effectively  characterize 
the  workload.  Hence,  feature  selection  is  an  essential  pro¬ 
cedure  for  building  a  workload  model  for  at  least  four  major 
reasons: 


(i)  Some  of  the  monitoring  activities  are  time-consuming 
and  costly.  By  reducing  the  feature  set,  unnecessary 
monitoring  may  be  eliminated. 

(ii)  Storage  requirements  are  reduced,  both  in  the  data- 
collecting  phase  and  in  the  workload  formulation 
and  testing  phases. 

(iii)  Computational  cost  is  reduced  in  the  designing  and 
the  testing  of  workload  models. 

(iv)  Reducing  the  number  of  features  will  reduce  the 
complexity  of  the  model. 


This  report  describes  an  algorithm  for  reducing  the  num¬ 
ber  of  features  by  way  of  a  linear  decision-tree  classifier 
which  at  worst  case,  performs  as  well  as  the  linear  machine 
frequently  used  in  pattern  recognition. 
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II.  Background  of  Feature  Selection 

The  problem  of  feature  selection  has  received  much  atten¬ 
tion  in  the  field  of  Pattern  Recognition.  Fu,  Min  and  Li  [5] 
summarized  the  feature  selection  methods  into  four  categories: 

(i)  Information  theoretic  approach. 

This  approach  assumes  that  the  data  has  a  multi¬ 
variate  Gaussian  density,  and  uses  the  divergence 
between  two  classes  or  a  general  separability  mea¬ 
sure  as  a  criterion  for  feature  selection. 


(ii)  Direct  estimation  of  error  probability. 

The  underlying  assumption  for  this  method  is  that 
the  distribution  density  of  the  data  is  not  known, 
so  small  windows  (Parzen-Window)  [10]  in  the  sample 
space  are  used  to  estimate  the  probability  density. 
The  probability  of  misclassif ication  is  used  as 
the  criterion  for  reducing  the  number  of  features. 

( iii )  Feature-space  transformation . 


The  Karhunen-Lo£ve  (K-L)  expansion  is  used  here, 
which  involves  finding  the  eigenvalues  and  the 
corresponding  eigenvectors  of  the  covariance  matrix 


of  the  sample  data  [6].  The  eigenvectors  represent  /  -7- 

*C| 

the  orthogonal  coordinates  (axes)  in  the  transformed/'^- 
space,  and  the  corresponding  eigenvalues  may  be  /  f’-v 

i  *'■?’  rj. 

seen  as  the  variances  with  respect  to  the  axes.  /  «ry. 
The  feature  reduction  is  then  achieved  by  deleting 

j  By 

the  axis  in  the  new  space  that  corresponds  to  the  /  h 

<■  t-  ..  .  u 

smallest  eigenvalue.  /  " 


P  l  <■  ,  .  . 

'  * -v.  T  U  r 


(iv)  Stochastic  automata  approach.  L)1 

A  learning  automata  is  constructed  as  a  feature/  j 
selection  scheme  where  the  automata  is  defined /in  / 
terms  of  training  samples  and  a  certain  decision^"" 
rule.  Each  action  made  by  the  automata  corresponds 
to  choosing  a  certain  subset  of  the  feature  set. 
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The  automata  receives  one  penalty  when  an  incorrect 
action  is  taken.  By  minimizing  the  total  number  of 
penalties,  and  then  identifying  the  action  that  is 
most  frequently  taken  by  the  automata,  we  obtain  the 
best  feature  subset. 

While  good  results  have  been  obtained  from  many  of  the  feature 
reduction  techniques  previously  mentioned  (especially  for  problems 
in  pattern  recognition) ,  much  valuable  information  which  is  es- 
sential  for  workload  modelling  is  lost  in  the  reduction  process. 

Agrawala  and  Mohr  [2]  discussed  the  feasibility  of  using 
the  methods  in  pattern  recognition  for  the  workload  character¬ 
ization  problem,  and  concluded  that  the  feature  reduction  by 
direct  re-clustering  does  not  work  well  for  workload  character¬ 
ization. 

Mamrak  and  Amar  [9]  used  a  backward  regression  method  in 
which  the  probability  of  error  of  a  particular  feature  subset 
for  describing  the  point  densities  is  used  as  the  criterion 
for  eliminating  features.  This  procedure  produces  good  re¬ 
sults  only  when  the  data  used  contains  some  features  which 
are  relatively  insignificant. 

The  eigenvalue  analysis  procedure  was  the  first  technique 
used  for  this  research.  However,  the  results  were  unsatis¬ 
factory  for  the  following  reasons:  (1)  a  reduced  feature 
space  is  produced  where  each  new  feature  is  a  linear  combina¬ 
tion  of  the  original  features,  and  (2)  the  new  feature  space 
is  not  optimized  for  identifying  clusters,  but  for  retaining 
the  variance  of  the  original  space. 

Hierarchical  clustering  [8]  and  multivariate  regression  were 
used  to  determine  if  some  relationship  exists  between  features, 
and  then  eliminate  features  by  substitution.  But  these  techniques 
only  work  when  there  exists  a  high  correlation  between  features, 
and  this  was  not  the  case  for  the  workload  data  considered  here. 

In  the  next  section,  a  procedure  for  reducing  the  feature 
set  for  workload  data  is  presented.  This  procedure  relies  on 
some  basic  concepts  of  pattern  recognition. 
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III.  Feature  Reduction  by  Decis ion- Tree 

The  two  problems  most  frequently  encountered  in  feature 
reduction  are: 

1)  The  underlying  distribution  density  for  each  class 
is  seldom  known. 

2)  The  selected  subset  of  features  produce  intolerable 
error  rate  when  reclassifying  the  sample  data. 

In  the  case  where  the  underlying  distribution  densities 

are  not  known,  it  is  sometimes  desirable  to  find  out  if  the 

classes  of  samples  are  linearly  separable,  (i.e.,  if  there 

exists  a  set  of  linear  discriminant  functions  that  can  separate 

all  classes  in  the  sample  space) . 

Consider  the  case  where  there  are  m  classes  C.,...,C  in 

.  i  m 

the  sample  set,  and  let  X1"  =  (x^  ...  xQ)  denote  a  feature 
vector,  where  x^  ...  Xp  are  the  p  feature  values.  Then  a 
linear  discriminant  function  for  a  given  class  is  defined  as 

g^X)  =  wit  X  +  wio  ,  i  »  l,...,m  (1) 

where  w^fc  *  ^wil' * ' * 'wip^  an<*  wiQ  are  weighting  coefficients 
such  that,  if  XeC^,  then 

g^X)  >  g^(X)  for  every  j  ?  i.  (2) 

The  problem  of  classification  now  becomes  one  of  finding 
the  w.  and  w^Q  in  (1)  such  that  (2)  is  satisfied.  After 
finding  each  w^  and  w^Q  for  all  i  «  l,...,m,  the  classi¬ 
fication  proceeds  as  follows:  for  any  new  data  point 
Z  ■  (z1,...,Sp),  assign  Z  to  class  if  g^U)  >  g^(Z)  for 
every  and  leave  Z  undecided  if  there  are  ties.  This  type 

of  classifier  is  called  the  linear  machine  [3].  Figure  1  gives 
an  example  of  a  3-class  classification  problem.  The  discrimi¬ 
nant  functions  g^,  g2  and  g3  are  obtained  from  three  training 
sets  C^,  C2  and  Cj.  Three  regions  R^,  Rj  and  R^  are  formed 
based  on  the  relationships  between  g^,  g2  and  g^.  Any  new 


Figure  1.  A  3-class  problem  in  linear-machine. 


data  point  7<  that  lies  within  would  then  have  g^(z)  >  g.(Z) 
for  j/i ,  and  hence  classified  as  in  class  i.  Note  that  the 
set  of  discriminant  functions  {g^...gm}  need  not  be  unique. 

This  linear  machine  can  be  further  generalized  into  quad¬ 
ratic  or  polynomial  discriminant  functions  in  order  to  obtain 
better  fits  to  the  boundaries.  However,  the  number  of 
coefficients  and  the  computational  complexity  will  increase 
exponentially,  so  the  linear  discriminant  functions  are  preferred 
if  the  error  rate  is  tolerable. 

In  view  of  the  second  problem,  deleting  features  from  the 
original  feature  set  is  equivalent  to  projecting  the  sample 
space  into  a  lower  dimensional  subspace  without  any  transforma¬ 
tion,  hence  causing  changes  in  the  distance  between  patterns 
as  well  as  classes.  So,  unless  the  classes  are  already  well- 
separated  in  a  particular  subspace,  feature  reduction  by 
brute-force  is  liable  to  introduce  intolerable  errors. 

Instead  of  trying  to  find  a  subset  of  features  to  classify 
all  classes  in  one  step,  we  concentrate  on  just  classifying 
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a  few  classes  (e.g.f  one  or  two)  from  the  rest  at  each  stage 
of  an  iterative  procedure.  As  a  result  of  this  procedure, 
all  classes  may  eventually  be  classified. 

For  example,  Figure  2  illustrates  a  decision-tree  clas¬ 
sifier:  g^  discriminates  from  the  rest  by  letting 

gx(X)  >  0  if  XeCL 

<  0  if  X^ 

Then  g2  separates  C2  and  by  letting 

g2(X)  >  0  if  XeC2 

<  0  if  x$c2 


Compared  to  the  classifier  in  the  previous  example,  we 
see  that  the  number  of  discriminant  functions  is  reduced. 

But  more  importantly,  since  g^  and  g2  are  only  involved  in 
separating  one  class  from  the  others,  the  chance  of  successfully 
reducing  features  is  considerably  increased  when  the  number 
of  features  is  high. 
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IV.  Algorithm  for  Formulating  the  Decision-Tree 


A  procedure  for  setting  up  the  decision-tree  classifier 
and  hence  achieving  feature  reduction  is  presented  here. 

Step  1.  Obtain  the  class  information  of  sample  data  in 
the  scaled  (normalized)  space. 


This  procedure  requires  labeled  data  set  in 
order  to  design  and  test  the  classifier. 

Since  no  absolute  standard  usually  exists 
for  labeling  the  data  classes,  a  clustering 
procedure  on  the  data  set  with  all  features 
used  is  often  required. 

Step  2.  For  each  feature  j  in  class  C^,  find  the  mean 

and  the  standard  deviation  SD^j  in  the  nor¬ 
malized  space. 

Step  3.  Use  each  feature  to  form  a  partition  of  the 

set  of  classes  C  =  (C. , . . . ,C_} . 

1  m 

First,  measure  the  "distance"  of  class 
and  on  feature  j  by  defining 


Mii  -  Mki 


Dj(Ci'Ck)  ~  SDi j  +  SD 


kj 


3  1 , .  .  ,  m . 


The  classes  and  are  called  "close" 
to  each  other  on  feature  j  if 

(i)  D.  (C.,C.)  <  t  for  some  threshold  t, 

J  1  K 

(ii)  there  exists  a  class  C  such  that 

e 

<  t  and  Dj  (C|  ,Ck)  <  t 

(i.e.,  D  is  transitive).  Now,  form  a  partition  of 
C  ■  ^cl'***Cm^  over  feature  j  by  putting  the  close 

classes  into  the  same  partition  " - ". 


P 


j 
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Step  4 . 


Rank  each  feature  by: 


(i)  the  number  of  classes  it  can  separate;  then 

(ii)  the  magnitude  of  distance  D(C^,C^)  by 
which  it  separates  the  classes. 

Step  5.  By  taking  the  intersections  of  different  par¬ 
titions  Plf  P2...Pp,  we  hope  to  find  one  or 
more  combinations  that  completely  partition 
C  into  single  classes. 

The  intersection  Pg^  of  two  partitions 

P  and  P.  is  defined  as: 
a  d 

*  (ckcr--  I  ckcr--EPa  and  ckcf-£Pb>- 

For  example,  if  P^  *  ^C1C2C3» c4}»  p2  =  tciC3»C2C4^' 
then  Plr>2  »  PjOP2  *  {cic3'^2'^4}- 

The  result  of  Step  4  is  used  here  for  se¬ 
lecting  features.  If  the  next  ranking 
feature  does  not  provide  any  additional  re¬ 
finement  in  the  partition,  then  we  may  skip 
this  feature. 

Step  6.  If  some  classes  are  not  separable  in  the 

above  step,  then  the  linear  machine  mentioned 
in  the  previous  section  is  used  on  the 
features  that  best  separate  these  classes 
(from  Steps  3  and  4)  in  order  to  derive  a 
linear  discriminant  function. 

Step  7.  Form  the  decision-tree  with  the  features  or 
linear  discriminant  functions  selected.  The 
root  node  of  the  tree  is  the  entire  sample 
set.  Its  successors  are  the  partitions  by 
one  of  the  features  (or  the  linear  discriminant 
functions)  selected  in  step  5  or  6.  Repeat  this 
branching  process  until  the  leaves  of  the  tree  are  all 
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Step  8. 


single  classes. 

Test  the  decision-tree  classifier  with  the 
test  data.  If  the  error  rate  is  greater 
than  desired,  go  back  to  Step  5. 

Note  that  there  is  no  guarantee  that  the  feature-reduction 
algorithm  presented  will  always  work;  for  some  of  the  data  may  be 
truely  non-feature-reducible.  But  viewing  the  linear-machine  as 
a  special  case  of  a  one-level  decision-tree  with  all  available 
features,  this  algorithm  will  do  at  least  as  good  as  the  linear 
machine . 

V.  Experiment  and  Result 

A  sample  data  set  with  1200  jobsteps  were  collected  from 
a  DEC  system-10  at  Wright-Patterson  Air  Force  Base  [7].  The  10 
features  of  the  workload  are  listed  in  Table  1. 

As  a  first  step,  a  clustering  program  was  used  to  cluster 
these  1200  jobsteps  with  all  10  features.  Eight  clusters 
(classes)  were  produced  that  reasonably  separated  the  1200 
data  points.  During  the  clustering,  normalized  feature  values 
were  used  to  avoid  having  any  large-scale  feature (s)  dominating 
the  clustering.  This  normalized  scale  was  used  throughout 
this  experiment.  The  centers  of  these  8  classes  in  scaled 
space  are  listed  in  Table  2. 

In  Steps  2  to  4,  the  first  600  jobsteps  were  used  to  design 
the  classifier.  The  partitions  of  the  set  of  all  classes 
C  =  {C^,...,Cg}  are  listed  in  Table  3  along  with  the  feature 
ranking.  is  the  partition  produced  by  feature  1,  P2  is 

the  partition  produced  by  feature  2,  etc. 

In  Step  5,  there  are  several  ways  of  obtaining  a  partition 
with  all  classes  separated.  For  example: 

P1^P8  “  ^c1('2C5C8'  C3'  C4'  C6'  C7^ 

Since  the  next  ranking  partition  (P^q)  does  not  provide 
an  additional  refinement,  we  proceed  to  select  the  next  highest 
ranking  partition  (P2) • 


9 


Pin8nP2  {C1C5C8'  C2'  C3 '  C4 '  C6'  C7} 

Again,  skip  partitions  Pg,  P3  and  Pg  for  the  same  reason 
given  for  skipping  P^q.  Now,  to  continue  this  process  we  select 
P^,  and  obtain: 

Pl/^8h2°P4  =  {C1C8'  C2 '  C3 '  C4'  C5'  C6'  C7 } 

Finally,  we  choose  Pg  which  yields: 

Pin8n2n^P9  =  {C1'  C2'  C3'  C4'  C5'  C6 '  C7 '  C8} 

Using  different  combinations,  we  also  may  have: 

p]nP2np^iP7nP8  =  cj,  cJT,  c c^} 

The  decision-tree  we  obtained  is  depicted  in  Figure  3. 

Notice  that  the  order  of  the  feature  appeared  in  the  decision- 
tree  may  not  necessarily  correspond  to  the  order  of  ranking. 

The  discriminant  functions,  shown  as  the  thresholds  on  the 
branches  of  the  tree,  were  determined  experimentally.  For 
example,  the  first  threshold  (i.e.,  for  feature  #2)  was  set 
equal  to  190  so  as  to  minimize  the  error  of  classification  be¬ 
tween  {C2)  and  {C^,  C3,  C^,  Cg,  Cg,  Cy ,  Cg}.  The  set  of  features 
is  reduced  from  10  to  6  features,  namely,  1,  2,  4,  5,  7,  and  8. 
The  600  design  samples  and  the  second  600  test  samples  are  re¬ 
classified  according  to  this  classifier,  and  the  results  are 
listed  in  Table  4.  The  confusion  matrix  of  re-classification 
on  the  total  1200  samples  is  shown  in  Table  5.  After  observing 
the  data  in  Tables  4  and  5,  it  is  obvious  that  the  feature 

reduction  procedure  presented  in  this  paper  produced  impressive 
results  when  applied  to  the  workload  data  under  consideration. 


VI.  Summary  and  Conclusion 


The  formulation  of  workload  model  is  an  essential  require¬ 
ment  for  performance  evaluation  of  a  computer  system.  Often 
the  measurements  made  on  the  workload  data  through  software 
or  hardware  monitors  can  be  quite  voluminous.  In  order  to 
reduce  the  features  and  yet  preserve  as  much  information  in 
classifying  the  workload  classes ,  we  have  proposed  a  decision- 
tree-classifier  feature  reduction  algorithm.  This  algorithm 
can  be  characterized  as  a  linear  machine  in  a  sequential  de¬ 
cision-tree,  in  that  each  decision  rule  involves  only  classi¬ 
fying  one  or  a  few  classes  at  a  time  using  a  linear  combination 
of  a  subset  of  features.  Viewing  the  linear  machine  as  a 
one-level  all-feature  decision-tree,  we  can  see  that  this 
algorithm  will  do  at  least  as  well  as  the  linear  machine. 

Many  feature  ranking  and  selection  algorithms  have  been 
proposed,  but  here  we  use  a  rather  straightforward  method  to 
simplify  the  procedures  involved. 

We  have  reduced  the  set  of  features  from  10  to  6  with  a 
total  error  rate  of  3.08%.  If  one  more  feature  (i.e.,  feature  4) 
were  deleted,  then  the  error  rate  would  be  roughly  doubled,  but 
still  within  7%.  Also,  the  decision-tree  may  not  be  the  optimal 
in  that  we  did  not  exhaust  all  the  combinations  of  the  partitions 
by  features.  Hence,  this  algorithm  does  have  the  potential  for 
achieving  feature  reduction  with  a  tolerable  rate  of  error. 
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Feature 


Name 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 


CPU  time  (sec) 

Core  size  (K-  words) 

Number  of  disk  I/O's 

Teletype  I/O  (number  of  characters) 

Number  of  interaction  counts 

Number  of  tape  I/O  counts 

Number  of  core  increases 

Average  core  increase 

Number  of  core  decreases 

Average  core  decrease 


Table  1  The  Features 
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CLUSTER  fx 

f  2 

f  3 

f  4 

f  5 

1 

-1.6756 

-1.0978 

-1.4413 

-.6985 

-.7196 

2 

-2.3260 

305.0797 

-2.1604 

-3.1165 

-1.8684 

3 

79.5851 

13.5612 

86.2258 

.5837 

.1518 

4 

6.0211 

8.5139 

7.3398 

4.5257 

3.7703 

5 

5.3334 

-.5153 

.4891 

75.4322 

79.6747 

6 

41.0921 

9.3341 

.2677 

1.2483 

-1.8684 

7 

1.3627 

18.8346 

5.7694 

1.3184 

.6205 

8 

9.9732 

7.9743 

5.9525 

-.9423 

-.7925 

CLUSTER  fg 

f  7 

f8 

f9 

f10 

1 

-.4985 

-1.7256 

-1.2986 

-1.6697 

-1.7356 

2 

-.5330 

-4.1164 

-3.8799 

-4.1962 

-3.1861 

3 

-.5330 

62.0908 

5.1289 

35.7363 

30.6155 

4 

-.5330 

14.6205 

20.6923 

-.2509 

82.43  A 

5 

-.5330 

1.2942 

4.3514 

-2.9503 

-3.0969 

6 

199.0540 

..  4.5673 

36.0107 

-2.1198 

-3.1861 

7 

-.5330 

11.3585 

93.3493 

-1.8890 

.0841 

8 

-.5330 

14.5811 

2.9368 

25.5050 

12.9688 

Table 

2.  Cluster 

Centers  in 

Scaled  Space 
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Partition 


Rank 


P1  ’  (ClC2C4C5C7Ce'  C3'  l 

P2  ’  (5I-  C1C3C4C5C6C7C8>  * 


P3  * 

1C1C2C4C5C6C7C8'  C3^ 

6 

P4  = 

{C1C2C3C4C6C7C8'  C5} 

8 

P5  = 

{C1C2C3C4C6C7C8 '  C5} 

7 

P6  * 

*C1C2C3C4C5C7C8'  C6* 

5 

P7  " 

{C1C2C4C5C6C7C8'  C3} 

9 

P8  “ 

{C1C2C3C5C8'C4C6'9 

2 

P9  = 

{C1C2C4C5C6C7'  C3C8} 

10 

pio= 

*C1C2C5C6C7C8'  C3,C4> 

3 

Table  3.  Partitions  by  features 
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